Final Project - Analyzing Sales Data
Date: 27 January 2023
Author: Tongtai mahasuwan
Course: Pandas Foundation
# import data
import pandas as pd
df = pd.read_csv("sample-store.csv")
# preview top 5 rows
df.head(5)
# shape of dataframe
df.shape
(9994, 21)
# see data frame information using .info()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9994 non-null int64
1 Order ID 9994 non-null object
2 Order Date 9994 non-null object
3 Ship Date 9994 non-null object
4 Ship Mode 9994 non-null object
5 Customer ID 9994 non-null object
6 Customer Name 9994 non-null object
7 Segment 9994 non-null object
8 Country/Region 9994 non-null object
9 City 9994 non-null object
10 State 9994 non-null object
11 Postal Code 9983 non-null float64
12 Region 9994 non-null object
13 Product ID 9994 non-null object
14 Category 9994 non-null object
15 Sub-Category 9994 non-null object
16 Product Name 9994 non-null object
17 Sales 9994 non-null float64
18 Quantity 9994 non-null int64
19 Discount 9994 non-null float64
20 Profit 9994 non-null float64
dtypes: float64(4), int64(2), object(15)
memory usage: 1.6+ MB
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9994 non-null int64
1 Order ID 9994 non-null object
2 Order Date 9994 non-null object
3 Ship Date 9994 non-null object
4 Ship Mode 9994 non-null object
5 Customer ID 9994 non-null object
6 Customer Name 9994 non-null object
7 Segment 9994 non-null object
8 Country/Region 9994 non-null object
9 City 9994 non-null object
10 State 9994 non-null object
11 Postal Code 9983 non-null float64
12 Region 9994 non-null object
13 Product ID 9994 non-null object
14 Category 9994 non-null object
15 Sub-Category 9994 non-null object
16 Product Name 9994 non-null object
17 Sales 9994 non-null float64
18 Quantity 9994 non-null int64
19 Discount 9994 non-null float64
20 Profit 9994 non-null float64
dtypes: float64(4), int64(2), object(15)
memory usage: 1.6+ MB
We can use pd.to_datetime() function to convert columns 'Order Date' and 'Ship Date' to datetime.
# example of pd.to_datetime() function
pd.to_datetime(df['Order Date'].head(), format='%m/%d/%Y')
# TODO - convert order date and ship date to datetime in the original dataframe
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%Y')
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%m/%d/%Y')
df[['Order Date','Ship Date']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order Date 9994 non-null datetime64[ns]
1 Ship Date 9994 non-null datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 156.3 KB
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order Date 9994 non-null datetime64[ns]
1 Ship Date 9994 non-null datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 156.3 KB
# TODO - count nan in postal code column
df.isnull().sum()
(df == '').sum()
# TODO - filter rows with missing values
# TODO - Explore this dataset on your owns, ask your own questions
Data Analysis Part
Answer 10 below questions to get credit from this course. Write pandas code to find answers.
# TODO 01 - how many columns, rows in this dataset
df.shape
(9994, 21)
# TODO 02 - is there any missing values?, if there is, which colunm? how many nan values?
df.isnull().sum()
# TODO 03 - your friend ask for `California` data, filter it and export csv for him
store_California = df[df['State'] =='California']
store_California.to_csv('store_California.csv')
# TODO 04 - your friend ask for all order data in `California` and `Texas` in 2017 (look at Order Date), send him csv file
store_California_Texas_2017 = df[(df['State'] =='Texas') | (df['State'] =='California') & (df['Order Date'].dt.year == 2017)]
store_California_Texas_2017.to_csv('store_California_Texas_2017.csv')
# TODO 05 - how much total sales, average sales, and standard deviation of sales your company make in 2017
df['Sales'][df['Order Date'].dt.year == 2017].agg(['sum','mean','std'])
# TODO 06 - which Segment has the highest profit in 2018
df[df['Order Date'].dt.year == 2018].groupby('Segment')['Profit'].agg("sum").sort_values(ascending =False).head(1)
# TODO 07 - which top 5 States have the least total sales between 15 April 2019 - 31 December 2019
df[(df['Order Date'] > '2019-04-15') & (df['Order Date'] < '2019-12-31')].groupby('State')['Sales'].agg("sum")\
.sort_values(ascending =False).head(5)
# TODO 08 - what is the proportion of total sales (%) in West + Central in 2019 e.g. 25%
Sale_wc_2019 = df[((df['Region'] == 'West') | (df['Region'] == 'Central')) & (df['Order Date'].dt.year == 2019)]['Sales'].agg('sum')
Sale_2019 = df[(df['Order Date'].dt.year == 2019)]['Sales'].agg('sum')
print(((Sale_wc_2019/Sale_2019) *100),"%")
54.97479891837763 %
# TODO 09 - find top 10 popular products in terms of number of orders vs. total sales during 2019-2020
df[(df['Order Date'].dt.year >= 2019) & (df['Order Date'].dt.year <= 2020)].groupby('Product Name')['Quantity']\
.agg('sum').sort_values(ascending = False).head(10)
df[(df['Order Date'].dt.year >= 2019) & (df['Order Date'].dt.year <= 2020)].groupby('Product Name')['Sales']\
.agg('sum').sort_values(ascending = False).head(10)
# TODO 10 - plot at least 2 plots, any plot you think interesting :)
df[['Sales', 'Profit']].plot(x='Sales', y='Profit', kind="scatter", color="lightblue")
<Axes: xlabel='Sales', ylabel='Profit'>
df['Segment'].value_counts().plot(kind='bar')
<Axes: >
# TODO Bonus - use np.where() to create new column in dataframe to help you answer your own questions
# My qustions:Group observations according to profit.
df['Profit'].plot(kind='box')
<Axes: >
import numpy as np
df['New_Profit'] = np.where(df['Profit'] < 0,"Very Bad","Bad")
df['New_Profit'] = np.where(df['Profit'] > 25,"Ok",df['New_Profit'])
df['New_Profit'] = np.where(df['Profit'] > 100,"Good",df['New_Profit'])
df['New_Profit'].value_counts()